This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (https://github.com/OFA-Sys/OFA) and demo (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly available.
translated by 谷歌翻译
Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
我们为在危险环境中运行的机器人提供最佳运动规划(OMP)算法的新配方,称为自适应高斯过程的随机轨迹优化(AGP-STO)。它首先将加速梯度下降重新启动,通过重新定义的Lipschitz常数(L-REAGD)来提高计算效率,只需要第一个动量。然而,它仍然无法推断出于高斯过程(GP)和障碍物的先前信息的非耦合问题的全球最优。因此,它可以集成L-ReeStimation过程中的自适应随机轨迹优化(ASTO),以通过加速移动平均(AMA)来学习重要样本的GP先前奖励。此外,我们介绍了增量的最佳运动计划(IMPH),将AGP-STO升级到IAGP-STO。它在先前优化的航路点之间逐步地插入轨迹,以确保连续的安全性。最后,我们将IAGP-STO基于数值(CHOMP,TRAJOPT,GPMP)和采样(STOMP,RRT-CONNECT)方法,并进行关键参数的调整实验,以显示L-REAGD,ASTO和IOMP的集成如何提升计算效率和可靠性。此外,在LBR-IIWA,MULTI-AGV和RETHINK-BAXTER上实施IAGP-STO,证明其在操纵,协作和援助中的应用。
translated by 谷歌翻译
本文介绍了一个新颖的运动计划者,逐渐随机和加速的梯度信息混合优化(ISAGO),用于狭窄的工作空间中的机器人操纵器。主要是,我们提出了由混合力量告知的Isago的整体方案,以根据惩罚方法进行有效的约束优化。在随机部分中,我们通过基于自适应动量(ADAM)方法的子功能的随机选择来生成自适应随机动量。由于随机部分的收敛缓慢,我们整合了加速梯度下降(AGD)以提高计划效率。此外,我们采用贝叶斯树推理(BTI)将整个轨迹优化(SAGO)转化为增量的亚trajectory优化(ISAGO),从而进一步提高了计算效率和成功率。最后,我们在书架上对LBR-IIWA上的其他5个计划者进行了关键参数和基准iSago,并在存储架上的Aubo-i5中调整了其他5个计划者。结果显示了iSago的最高成功率和中等解决效率。
translated by 谷歌翻译
快速移动受试者的运动模糊是摄影中的一个长期问题,由于收集效率有限,尤其是在弱光条件下,在手机上非常常见。尽管近年来我们目睹了图像脱毛的巨大进展,但大多数方法都需要显着的计算能力,并且在处理高分辨率照片的情况下具有严重的局部动作。为此,我们根据手机的双摄像头融合技术开发了一种新颖的面部脱毛系统。该系统检测到主题运动以动态启用参考摄像头,例如,最近在高级手机上通常可用的Ultrawide Angle摄像机,并捕获带有更快快门设置的辅助照片。虽然主镜头是低噪音但模糊的,但参考镜头却很锋利,但嘈杂。我们学习ML模型,以对齐和融合这两张镜头,并在没有运动模糊的情况下输出清晰的照片。我们的算法在Google Pixel 6上有效运行,每次拍摄需要463毫秒的开销。我们的实验证明了系统对替代单片,多帧,面部特异性和视频脱张算法以及商业产品的优势和鲁棒性。据我们所知,我们的工作是第一个用于面部运动脱毛的移动解决方案,在各种运动和照明条件下,在数千个图像中可靠地工作。
translated by 谷歌翻译
视频博客和自拍照是流行的社交媒体格式,通常由广角相机捕获,以显示人类受试者和扩展的背景。遗憾的是,由于透视投影,靠近角落和边缘的面孔表现出明显的扭曲,延伸并挤出面部特征,导致视频质量差。在这项工作中,我们展示了一种视频扭曲算法来纠正这些扭曲。我们的主要思想是在面部地区本地应用立体投影。我们使用空间 - 时间能量最小化配制网眼翘曲问题,并使用线路保存术语最小化背景变形,以维持背景中的直边。为了解决时间一致性,我们通过潜在变量限制了翘曲网格和面部轨迹上的时间平滑度。对于性能评估,我们开发了具有广泛焦距的广角视频数据集。用户学习表明,83.9%的用户更喜欢基于透视投影的其他替代方案的算法。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Blind image quality assessment (BIQA) remains challenging due to the diversity of distortion and image content variation, which complicate the distortion patterns crossing different scales and aggravate the difficulty of the regression problem for BIQA. However, existing BIQA methods often fail to consider multi-scale distortion patterns and image content, and little research has been done on learning strategies to make the regression model produce better performance. In this paper, we propose a simple yet effective Progressive Multi-Task Image Quality Assessment (PMT-IQA) model, which contains a multi-scale feature extraction module (MS) and a progressive multi-task learning module (PMT), to help the model learn complex distortion patterns and better optimize the regression issue to align with the law of human learning process from easy to hard. To verify the effectiveness of the proposed PMT-IQA model, we conduct experiments on four widely used public datasets, and the experimental results indicate that the performance of PMT-IQA is superior to the comparison approaches, and both MS and PMT modules improve the model's performance.
translated by 谷歌翻译